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Abstract 


Stability  conditions  can  be  thought  of  as  a  way  of  controlling  the  variance 
of  the  learning  process.  Strong  stability  conditions  additionally  imply  con¬ 
centration  of  certain  quantities  around  their  expected  values.  It  was  shown 
recently  that  stability  of  learning  algorithms  is  closely  related  to  their  gen¬ 
eralization  and  consistency.  In  this  paper  we  examine  stability  conditions 
from  this  point  of  view,  complementing  the  results  of  [6,  5]. 
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1  Introduction 


This  paper  is  motivated  by  the  results  of  [5]  (see  also  [6]).  Mukherjee  et  al. 
[5],  building  on  work  by  [2]  and  by  [1],  showed  that  a  certain  form  of  double¬ 
sided  cross-validation  leave-one-out  sfabilify  is  nof  only  necessary  and  sufficienf 
for  generalization  and  consisfency  of  ERM  buf  if  is  also  sufficienf,  when  ex¬ 
pected  and  empirical  leave-one-out  stability  hold,  for  generalizafion  of  any  sym- 
mefric  learning  algorifhm.  In  fhis  paper,  we  describe  a  few  addifional  resulfs 
fhaf  will  hopefully  illuminafe  better  fhe  role  of  sfabilify  in  generalizafion.  We 
work  in  fhe  change-one  framework  insfead  of  fhe  leave-one  framework.  We  show 
fhaf  a  weak  form  of  sfabilify  called  pseudosfabilify  (see  [5],  definition  3.9  for 
fhe  leave-one-ouf  case)  is  nof  only  necessary  and  sufficienf  for  ERM  algorifhms 
buf  is  also  sufficienf  for  generalizafion,  if  expected  and  empirical  change-one-out 
stability  hold  wifh  sufficienfly  fasf  rafes.  We  also  show  fhaf  by  using  a  sfronger 
definition  of  CV  sfabilify  fhan  [5]  we  are  able  fo  ensure  generalizafion  by  using 
only  expecfed  sfabilify,  wifhouf  empirical  sfabilify. 


2  Extending  McDiarmid's  Inequality 

McDiarmid's  inequalify  has  been  used  in  fhe  pasf  few  years  fo  obfain  con- 
cenfrafion  resulfs  from  sfabilify  condifions.  These  sfabilify  conditions  can  be 
fhoughf  of  as  Lipschifz  condifions  on  fhe  map  from  sefs  fo  functions  (i.e.  a 
change  in  fhe  framing  sef  does  nof  affecf  fhe  outpuf  function  by  more  fhan  (3  = 
Lipschifz  consfanf).  When  fhe  Lipschifz  consfanf  can  be  shown  fo  be  decreas¬ 
ing  in  fhe  number  of  poinfs  fasfer  fhan  0{\/y/n),  concenfrafion  resulfs  follow 
from  McDiarmid's  inequalify. 

Theorem  2.1  (McDiarmid,  [4])  Let  ft  i, ...,  be  probability  spaces.  Let  LI  —  0^=1 
and  let  Xbea  random  variable  on  LI  which  is  uniformly  difference-bounded  by  /3„  (i.e 
for  any  k  if  to,  oj'  &  LI  differ  only  in  the  k-th  coordinate,  then  |2f(w)  —  X{u}')\  <  /3„), 
then  for  any  e  >  0, 


P(X-lEX>.)<exp(^) 

Kufin  and  Niyogi  [2]  exfended  McDiarmid's  inequalify  fo  include  a  possibilify 
of  a  bad  evenf: 

Theorem  2.2  (Kutin,Niyogi)  Let  X  be  a  random  variable  (|X|  <  1)  on  LI  which 
is  strongly  difference-bounded  by  (^,exp(— iLn))  (i.e.  there  is  a  bad  subset  B  c  LI 
of  measure  exp{—Kn)  s.t.  for  any  k  if  €  Ll  differ  only  in  the  k-th  coordinate, 
then  |X(a;)  —  X{uj')\  is  bounded  by  ^  if  u;  f  B  and  by  1  otherwise),  then  for  any 
0  <  e  <  2X^/K  and  n  >  max{  3(-|^  -I-  3)  ln(-|^  -I-  3)}, 

P(X-IEX>e)<exp(^^) 
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This  theorem  is  a  special  case  of  a  more  general  theorem  proved  by  Kutin  and 
Niyogi  [2]: 


Theorem  2.3  (Kutin,Niyogi)  Let  X  be  a  random  variable  (|2f|  <  1)  on  ft  which  is 
strongly  difference-bounded  by  (/3„,  5),  1  >  /3n  >  0.  Then  for  any  e. 


P(X 


]EX  >  e)  <  2 


nSn\ 
Pu  ) 


For  the  above  bound  to  decrease  with  n,  /3„  has  to  decrease  faster  than  0(1/  -y/n)  • 
Additionally,  Sn  has  to  decrease  faster  than  /9„/n,  i.e.  faster  than 
We  now  give  an  example  of  a  random  variable  which  is  strongly  difference 
bounded  by  (0,  but  is  not  concentrated: 


Example  Let  w  =  (loi,  ...,ujn)  G  [0,1]"-  Let  A(w)  =  0  if  the  number  of  w/s 
greater  than  1  /2  is  larger  than  [n/2]  and  A(w)  =  1  otherwise.  In  other  words, 
X  takes  values  0  or  1  depending  on  whether  the  majority  of  the  points  falls  to 
the  left  or  to  the  right  of  1/2.  Note  that  a  change  of  one  point  does  not  change 
the  value  of  X  unless  the  set  cj  of  points  is  balanced.  The  probability  of  this 
event  is  0(l/i/n).  Even  though  the  measure  of  the  "bad  event"  decreases,  X  is 
not  concentrated:  lEA  =  1/2  by  symmetry. 


The  above  example  shows  that  McDiarmid's  inequality  carmot  be  extended  to 
give  a  useful  result  for  bad  sets  of  measure  S  —  0(l/^/n),  while  the  exten¬ 
sion  by  Kutin  and  Niyogi  shows  that  for  fast  enough  rates  (5  =  and 

appropriate  /3„),  X  is  concentrated  around  its  mean. 


3  Concentration  and  Stability 

Lets'  =  (zi, Zn)  and  =  (zi, Zi-i,  z,  Zi+i, Zn)-  For  brevity  of  nota¬ 
tion,  let  fs  be  the  loss  function  when  trained  on  the  set  S  (i.e.  V  {fs,  z)  in  the 
notation  of  [5]).  Assume  that  such  functions  are  upper  bounded  by  M. 
Consider  the  following  stability  definitions  for  change-one  that  is  replacement 
of  one  point  (compare  with  the  analog  EEjoo  and  E/oo  definitions  of  [5]  for  the 
leave-one-out  case): 

Definition  3.1  We  say  that  an  algorithm  is  {Pemp,Semp)  Empirical  Error  stable  if 
with  probability  1  —  Semp  (over  the  choice  of  S), 


Vz, 


1 

n 


fsizj) 

ZjGS 


^  Y  fs-4zj) 


<p 


emp 


Definition  3.2  We  say  that  an  algorithm  is  {Pexp,  Sexp)  Expected  Error  stable  if  with 
probability  1  —  Sexp  (over  the  choice  of  S), 

Vz,  |IE„/s('u)  -  ]Eufsi,-{u)\  <  Pexp 
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By  the  concentration  result  of  the  previous  section,  if  (iemp  =  o{l/^/n),  Semp  = 
o{(}emp/n),  fhe  empirical  errors  are  concenfrafed  around  fheir  expecfed  value 
=  ^sfs{zi)  and  if  (3exp  =  o{l/^/n),  6exp  =  o{f3exp/n),  the 
expecfed  errors  are  concenfrafed  around  fheir  expecfed  value  IEs_z/s(z): 

Proposition  3.1  If  an  algorithm  is  {Pemp,  Semp)  Empirical  Error  stable,  then  with 
probability  at  least  1  -  (2  exp 

|IEs/s(2:i)  -  -  ^  /s(2:i)|  <  e. 

ZjGS 

Proposition  3.2  If  an  algorithm  is  {Pexp,  Sexp)  Expected  Error  stable,  then  with  proh- 
ability  at  least  1  -  (2exp 

\TE.s,zfs{z)  -  tE.^fsiz)\  <  e. 

If  fhe  above  two  sfabilify  condifions  hold, 

IEt(IE,/t(z)  -  ^  /t(^,))"  «  tET[tEs,zfs{z)-tEsfs{zi)]^ 

Zj^T 

=  tET\ths,z{fs{z)-fs^Az))f 

and  fherefore  for  fhe  second  momenf  fo  decrease,  fhere  musf  be  a  condition 
forcing 

IEs,2  ifsiz)  -  fs-.-iz))  0. 

We  will  call  fhis  condition  CV-Pseudostability. 

Definition  3.3  We  say  that  an  algorithm  has  (3ps  CV-Pseudostability  if 
|IEs,^  ifsiz)  -  fsi.-iz))  I  <  Pps 

This  is  the  analog  of  fhe  leave-one-ouf  pseudosfabilify  defined  by  [5]  in  defi- 
nifion  3.9,  which  is  weaker  fhan  fheir  CVioo  sfabilify  because  because  fsiz)  — 
fsi,^  iz)  has  fo  be  small  only  on  average. 

Now  nofe  fhaf  Empirical  Error  Sfabilify  for  fhe  removal  case  implies  Empirical 
Error  Sfabilify  for  replacemenf  (wifh  appropriafe  rafes),  and  fhe  same  holds 
for  fhe  Expecfed  Error.  Therefore,  CV-Pseudosfabilify  gives  a  weak  condi- 
fion  which  fogefher  wifh  Error  Sfabilify  and  Empirical  Sfabilify  wifh  fhe  rafes 
Pn  =  oil/^/n),  Sn  =  oiPn/n),  imply  convergence  of  fhe  empirical  error  fo  fhe 
expecfed  error  for  any  symmefric  algorifhm. 

To  elaborafe  more  on  fhis  poinf,  assume  fhe  Error  Sfabilify  (removal)  and  fhe 
empirical  Sfabilify  (removal).  Because  of  fhe  Error  Sfabilify,  tEs,zfsiz)  ~  ^sfs*  izi)- 
Also,  tEs,zfs^'‘  iz)  =  thsfsizi).  Therefore,  franslafed  info  fhe  removal  case,  fhe 
CV-Pseudosfabilify  condifion  becomes 

Es  ifs'izi)  -  fsizi))  0. 
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This  is  exactly  CVioo  stability  without  absolute  values  and  was  called  pseu- 
doPH  stability  in  definition  3.9  of  [5].  We  fherefore  conclude  fhaf  for  fhe  re¬ 
moval  case.  Error  Sfabilify  fogefher  wifh  Empirical  Sfabilify  (wifh  rafes  /3„  = 
o{l/^/n),  Sn  =  o{f3n/n))  and  pseudoPH  sfabilify  are  enough  for  generaliza- 
fion.  This  resulf  should  be  compared  wifh  Theorem  3.1  of  [5]):  here  we  also 
obfain  generalizafion  by  assuming  a  weaker  CV  sfabilify  buf  stronger  empirical 
and  expecfed  sfabilify. 


4  Lower  Bounds  Using  Stability 


In  fhis  secfion  we  will  lower-bound  fhe  second  momenf 


TEsOEJsiz)  -  ^  fsiz,)^. 

ZjGS 


Clearly, 


^si^Jsiz)  -  Y.  > 

ZjGS 


0Es,Js{z)-^s  Y  fs{zj)r 

ZjGS 

[IEs,z  ifsiz)  -  fs{zi))f 
[Ms,Afsiz)-fs.Az))f 


Therefore,  convergence  of  lEg^z  {fs{z)  —  fs*.^  (^))  to  zero  (CV-Pseudostability) 
is  a  necessary  condition  for  the  convergence  of  the  empirical  to  the  expected. 
Por  ERM,  this  condition  is  also  sufficient  (see  next  section). 

We  now  examine  the  question  of  necessity  of  all  three  stability  conditions  posed 
by  [5],  but  with  CV-Pseudostability  instead  of  CVioo  as  the  first  condition.  As¬ 
sume  that  CV-Pseudostability  holds  {Pps  0).  Assume  additionally  that  Error 

Stability  holds  {f3err  =  o  Arr  =  o{Perr/n)).  Then 


^As{z)  -  Y  fsAj)  =  i^Asiz)  -  Ms,zfs{z)) 

ZjGS 

+  {TEs,Js{z)-TEs,Js^.Az)) 

+  [iEs../sm(z)-  y] /s(^,) 

V 

The  first  term  is  bounded  by  the  concentration  of  expected  values  around  their 
mean  (follows  from  Expected  Stability  and  the  results  of  the  previous  section). 
The  second  term  is  bounded  by  CV-Pseudostability.  Therefore, 


IE./s(^)  -  E  fsizj) 

ZjGS 


«  ]Es,As^Az)  -  Y  fsizj) 

ZjGS 

=  TEt  Y  -  T. 

Zj^T  Zj^S 
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Therefore,  for  fhe  empirical  fo  converge  fo  expecfed,  we  musf  require  concen- 
frafion  of  empiricals  around  fheir  mean.  Empirical  Sfabilify  does  imply  fhis 
concenfrafion,  buf  if  mighf  be  possible  fo  have  a  weaker  requiremenf.  Simi¬ 
larly,  if  we  have  CV-Pseudosfabilify  and  Empirical  Sfabilify,  we  musf  require 
a  concenfrafion  of  expecfed  values  around  fheir  mean.  This  is  implied  by  Ex¬ 
pecfed  Sfabilify,  buf,  again,  fhere  mighf  be  a  weaker  condifion. 


5  Empirical  Risk  Minimization 

Proposition  5.1  CV-Pseudostability  is  necessary  and  sufficient  for  consistency  and 
generalization  of  any  Emprical  Risk  Minimization  algorithm. 

PROOFiEmpirical  Risk  Minimization  searches  in  fhe  function  class  T  for  a  func- 
fion  which  minimizes  (or  e-minimizes)  empirical  risk.  Assume  f*  is  fhe  loss 
function  wifh  fhe  smallesf  expected  error,  i.e.  IEz/*(z)  <  vaig^j^Wjzg{z).  Con¬ 
sider  fhe  shiffed  loss  class  Q  =  T  —  f*  =  {g'  =  f  —  f*\f  G  T}-  Lef  gs  =  fs  —  f*- 
Nofe  fhaf  if  fs  is  an  empirical  minimizer  in  T  w.r.f.  sef  S,  fhen  gs  is  the  empir¬ 
ical  minimizer  in  Q  w.r.t.  S. 


^zfs{z)  --Y1  =  ^^gs{z)  --Y1  9s{zj) 

^  ZjGS  ^  ZjGS 

+  TEzriz)  -  - 

The  second  term  tends  to  zero  by  Hoeffding's  inequality.  Therefore,  gener- 
alizafion  over  class  T  is  equivalenf  fo  generalization  over  Q.  Moreover,  nofe 
that  ^T,z,gs9s{zj)  <  0  because  fhe  zero  funcfion  is  in  fhe  class  Q  and  fhaf 
^zgs{z)  >  0  because  f*  attains  fhe  smallesf  expecfed  error.  Therefore,  JEzgs{z)  — 
^I1zjgs9s{zj)  >  0  and  so  convergence  TEzgsiz)  -  ^T,z^(^s  9s{zj)  ^  0  is 
equivalenf  fo 


lEs  I  ^zgs{z) - 9s{zj)  0. 

71 


Zj^S 


Rewriting, 


Es  lRzgs{z) - ^  9s{zj)  =  TRs,z  (gsiz)  -  gsizi)) 


Zj^S 


=  ^s,z{gs{z)  -  gsi,^{z)) 
=  lRs,z{fs{z)-fs^Az)) 


As  shown  in  Theorem  3.4  of  [5],  CVioo-Pseudosfabilify  is  also  equivalenf  fo 
generalization  and  consisfency  of  ERM  (for  ERM  CVioo  pseudosfabilify  and 
CV loo  sfabilify  are  equivalenf). 
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6  Bounding  Generalization  Error  without  Strong  Con¬ 
centration  Results 

Mukherjee  et  al  [5]  showed  that  for  the  replacement  case,  CVioo  stability  to¬ 
gether  with  Expected  and  Empirical  stabilities,  is  sufficient  for  generalization. 
Their  mefhod  (similar  fo  fhaf  of  Devroye  and  Wagner  [3])  bounds  fhe  second 
momenf  of  fhe  difference  of  fhe  expecfed  and  empirical  errors.  No  concen- 
frafion  of  fhe  errors  around  fheir  average  values  is  required  for  fhis  mefhod. 

We  now  prove  a  similar  resulf  for  fhe  replacemenf  case  and  show  fhaf  by  us¬ 
ing  a  somewhaf  sfronger  definifion  of  CV  sfabilify  (which  is  differenf  from  buf 
consisfenf  wifh  our  definifion  of  pseudosfabilify)  we  can  prove  sufficiency  for 
generalization  using  only  fhe  expecfed  sfabilify  (wifhouf  empirical  sfabilify). 

Definition  6.1  We  say  that  an  algorithm  is  {f3cv,Scv)  strongly  CV  stable  when 

IP  (Vi,  \fs{z)  -  fsi.-{z)\  >  Pcv)  <  Scv 

Alternative  form  (by  symmetry): 

IP  (Vi,  -  fsizi)\  >  Pcv)  <  Scv 

It  is  crucial  that  the  quantifier  Vi  is  inside  of  fhe  probabilify.  Thus  fhis  definifion 
is  sfronger  fhan  CN loo  sfabilify  (or  ifs  change-one  analog).  Also  nofe  fhaf  fhe 
probabilifies  can  be  faken  over  n  -I- 1  poinfs  or  over  n  poinfs  wifh  a  fixed  z. 

Proposition  6.1  For  any  i  f  j,  with  probability  at  least  1  —  Scv, 

\fs{zj)  -  fsi'4zj)\  <  2/3c«. 


Proof: 


\fs{Zj)  -  fs‘--{Zj)\  <  \fs{Zj)  -  fsi.-{Zj)\  +  \fsi.4zj)  -  fs-.4zj)\ 

Both  terms  above  are  bounded  by  CV  stability.  Indeed,  in  the  first  term,  we're 
starting  with  the  set  which  does  not  contain  zj  and  replacing  it  by  the  set 
(5'f.z)i.zj  =  g  which  contains  it.  In  the  second  term,  we're  starting  with  the 
set  which  does  not  contain  Zj  and  replacing  it  by  the  set  (S'^A)*’^3  = 
which  contains  it. 


Proposition  6.2  Strong  CV  stability  implies  that 


1 

n 


Zi&S 


Zi^S 


<  Pc 


with  probability  1  —  Scv 
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Proof: 


1 

n 


ZiGS 


ZiGS 


<  -  X  \fs{z^)  -  fs*.4zi)\  <  /3c. 

rj  ^ ^ 


ZiGS 


The  reason  the  probability  of  this  event  does  not  increase  is  due  to  the  way  we 
defined  strong  CV  sfabilify  The  pair  S,  z  is  "good"  wifh  probabilify  1  —  6cv  and 
fhen  any  coordinate  i  can  be  changed. 

Proposition  6.3  Strong  CV  stability  and  Expected  Error  stability  imply  generaliza¬ 
tion.  More  precisely, 

TVs(MJs{z)  -  -  X  +  3/3exp  +  2M6ec.p  +  l/n) 

^  ZiGS 

Proof: 


^si^zfsiz)  -  -  X  fsiziW  =  Es  [^Js{z)^z'fs{z')]  -  Es 


ZiGS 

Eg 


Ez/s(^)  -  X  fsiz^) 


ZiGS 


ZiGS 


ZiGS 


—  Es 


(JEJsiz))  -  X  fsi^i) 


ZiGS 


First, 


Eg 


E^/g(z)  (  -  X  fs{zi) 


ZiGS 


=  EgEJ/g(0)iX/^^(^*)) 

\  Zi^S  } 

=  Eg,,  (/g(z)/g(zi)) 


Second, 


EgE,/g(0)E,,/g(z')  =  Eg(E,/g(0)E,,/g(z'))  -Eg(E,/g(0)E,,/s...K^')) 

+  Eg  {TEJs{z)TEz'fsi.'  {z'))  -  Eg,,,  (E,/^.,.,  (0)/g.,.,  (^0) 

+  Eg,,,,,  (/s,,,/(z)/s,,,,(z')) 

<  M{Pcv  +  MScv)  +  +  M6exp)  +  Eg/g(2;)/g(zj) 


We  bound  the  first  term  using  CV  stability,  second  using  Expected  Error  stabil¬ 
ity,  and  use  symmetry  at  the  last  step. 

Einally,  by  Proposition  6.2 


X 

ZiGS 


<  M{Pcv~EMScv)~E^S 
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Furthermore,  for  f  ^  j  by  symmetry. 


=  {fs^.  {Zi)fs{Zj))  +  {zi)fs{z.)) 

<  Es.z  (/s*.-  {Zi)fsizj))  +  M/n. 

Now,  by  symmetry  Es,j/si.z(zj)/s(zj)  =  ]Es,zfs{z)fsi'‘{zj)-  By  Proposition 
6.1,  with  probability  1  -  Scv,  \fsi.^{zj)  -  fs{zj)\  <  2Pcv  Therefore, 

Es,.  ifsi.4z^)fs{zj))  <  M(2/l,„  +  M6cv)  +  Es  (/s(^)/s(^,))  ■ 

Putting  it  together. 


Eg 


ZiSS 


ZiGS 


Eg 


E 

ZiGS 


E 


ZiGS 


<  Eg,,  {fs{z)fs{zj))+M{3(3cv+2MS,,)+M/n 


The  grand  total  is: 

Eg(E,/g(z)  -  -  E  /s(2*))"  <  +  3M6,,  +  3j3e^j,  +  2M5e.p  +  1/n) 

n 

ZiGS 


7  Remarks 

This  paper  clarifies  a  few  questions  left  open  by  previous  work  on  stability  and 
specifically  by  [6,  5].  In  particular,  it  clarifies  that  if  pseudostability  holds  nei¬ 
ther  empirical  nor  expected  error  alone  is  sufficient  to  ensure  generalization. 
More  importantly,  it  shows  that  there  exist  several  alternative  stability  condi¬ 
tions  which  are  sufficient  for  generalization  in  general  and  are  all  equivalent  to 
generalization  and  consistency  of  ERM. 
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Figure  1:  An  overall  view  of  some  of  the  properties  discussed  in  [5]. 
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Figure  2:  The  two  main  new  results  of  this  paper. 
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